Personal selection of AI research articles

A collection of well-written articles that I enjoyed reading and helped me grow as an AI professional

Photo by Janko Ferlic on Unsplash

Reading research articles is an important practice as an AI professional. Instead of providing an exhaustive list of seminal AI papers, I highlight a few that were important milestones in my own learning journey. These include some less-cited but well-written papers that significantly shaped my understanding of the subject. I aim to strike a balance across different fields of AI, although the computer vision section may become somewhat larger, as I have had significant experience in that subfield. I also mention papers adjacent to machine learning (e.g., articles on classical image processing) that may help clarify some foundational concepts.

Deep generative modeling

Kingma, D. P., & Welling, Auto-Encoding Variational Bayes (2013), Proceedings of the International Conference on Learning Representations (ICLR)

Generative modeling is an unsupervised form of machine learning where the model learns to discover the patterns in input data. Among these deep generative models, two major families stand out and deserve special attention: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). VAEs are autoencoders that tackle the problem of the latent space irregularity of classical autoencoder neural networks by making the encoder return a distribution over the latent space instead of a single point. The distribution of the encodings is regularized during training to ensure that the latent space has good properties, allowing new data to be generated. The loss function of VAEs is composed of a reconstruction term and a regularization term: the Kullback-Leibler divergence between the prior latent distribution and its approximated distribution given the input data. This loss is derived using variational inference. The following articles by Joseph Rocca on Towards Data Science were helpful for understanding the paper: "Understanding Variational Autoencoders (VAEs)" and "Bayesian inference problem, MCMC and variational inference". I also found the following Stack Exchange post about the calculation of the Kullback-Leibler divergence between two multivariate Gaussians useful.

Computer Vision

Classical computer vision

David G. Lowe, Distinctive image features from scale-invariant keypoints (2004), International Journal of Computer Vision

This paper presents the scale-invariant feature transform (SIFT), a method to extract feature points and corresponding descriptors (or feature vectors) from images that are invariant to scale and rotation, and robust to affine distortion, 3D viewpoint change, noise, and illumination. First, a scale-space pyramidal representation of the image is constructed and extrema corresponding to blobs are located in that space using 3D quadratic function optimization after computing differences of Gaussians. Keypoints with low contrast and corresponding to strong edges are eliminated. Each keypoint (extrema) is characterized by its 2D position, scale, and orientation, derived from the local orientation histogram. The local image descriptor is composed of the binned local orientation histograms along the keypoint direction around the keypoint for each neighbor box. I found that the videos of Pratik Jain about homography, image registration, the Harris corner detector and its properties, and the SIFT invariant features and feature descriptors were very helpful for understanding the context of the paper and its concepts.

Computer vision with deep learning

Jonathan Long et al., Fully convolutional networks for semantic segmentation (2015), Proceedings of the IEEE conference on computer vision and pattern recognition

This paper uses fully convolutional networks (FCN), that is, networks that only use convolutions and no fully connected layers, to perform segmentation of natural images. In opposition to classical ConvNets for image classification, FCNs do not have a fixed input size image; they also have low time complexity. 1D convolutions are used to replace fully connected layers to produce a coarse heat map where each channel represents a class, and “deconvolution” or upsampling layers are used to map features at the coarse level to larger 2D output segmentation results. Combining the information from features at different depths in the network using skip connections helps refine the dense output by adding localization information to more content-related features. I found the implementation in the d2l.ai book helpful for understanding the paper, although the skip layers were not implemented.

Olaf Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), International Conference on Medical image computing and computer-assisted intervention

U-Net is a neural network for image segmentation that uses relatively symmetric contracting and expanding paths (that hence form a U shape) with skip connections. That architecture leads to a good balance between localization accuracy and the use of context, while keeping the computing cost low. It is based on the FCN architecture, but the difference is that the U-Net has many feature channels in the up-sampling part.

Tsung-Yi Lin et al., Feature Pyramid Networks for Object Detection (2017), Proceedings of the IEEE conference on computer vision and pattern recognition

This paper introduces feature pyramid networks (FPN), a framework that uses the inherent multi-scale pyramidal hierarchy of ConvNets with low-resolution semantically strong features and high-resolution but semantically weak features, to construct a feature pyramid that has strong semantics at all scales. The bottom-up pathway of the FPN is the feed-forward computation of the backbone ConvNet, computing feature maps at several scales. The subsequent top-down pathway upsamples the feature map from the highest level, while lateral connections enhance it at each level via element-wise addition. Prediction is performed at each scale of the top-down pathway. The authors adapt Region Proposal Network (RPN) and Fast R-CNN to the FCN framework respectively for bounding box proposal generation and object detection, and also show high performance with segmentation.

Chen et al., A Simple Framework for Contrastive Learning of Visual Representations (2020), International conference on machine learning

Unsupervised representation learning is successful in natural language processing, but supervised pre-training still prevails in computer vision. Furthermore, constructing large-scale labeled datasets is a difficult task, so self-supervised learning could be useful in computer vision. In the SimCLR method, data is augmented using random cropping, resizing, and color distortion to form positive and negative pairs of images. SimCLR learns two functions, f and g: f is the representation encoder, and g is the "projection head". The contrastive loss function—a normalized cross entropy loss with adjustable temperature—is minimized for g o f. One key finding is that unsupervised learning seems to benefit more from scaling up (e.g., increasing model size, batch size, and training epochs, and performing data augmentation) than supervised learning.

Published on January 5, 2023, last update on May 17, 2026